The purpose of this project is to use predictive modeling in order to predict customer churn, also known as customer attrition. Specifically, we aim to predict which customers of a bank will leave their credit card services. In addition, we will be performing exploratory data analysis in order to discover relationships with customer churn and visualize the data.
Customer churn, also known as customer attrition, is the loss of customers. For a business, when a customer churns, the customer chooses to stop using the business’s products and/or services. In other words, the customer no longer interacts with the business. This of course, is tied directly to the amount of revenue the business gains.
Businesses measure churn as a percentage of customers lost, otherwise known as churn rate. Usually, this metric is tracked monthly and reported by the end of each month. It is important to note that churn rates vary by industry, as they are dependent on the type of market the company is in. As one could imagine, churn rate is a critical metric for businesses with subscription-based models, such as Netflix or Spotify.
Naturally, customer churn is inevitable, as there are customers who are going to leave or stop using an entity’s products anyway. However, it is a fact that it costs more obtain to new customers than it is to maintain existing customers. The reason for this is because existing customers are likely to spend more on a company’s services or products. Therefore, if a company attempts to retain its existing customers, it will spend less money on the operating costs of having to acquire new customers. Furthermore, it is less time consuming to convince an existing customer to stay with the company than it is to convince someone that is not associated with the company.
Not to mention, reducing customer churn could add revenue to a business. A small churn rate could be an indicator for future growth. If a company is considering on adding new products or services in the future, the ones who are most likely to purchase them are existing customers, as they would already be familiar with the company’s brand. Plus, satisfied customers can positively impact a company’s brand!
Bottom line is, customer churn rate is very important, especially if the rate is high or increasing. This is why it is helpful to build a predictive model of churn in order to assess which customers will leave, and take action based off the learned information and predictions.
Before we begin building our machine learning model, it is important to perform exploratory data analysis. Most of the time when we load raw data, it needs to be pre-processed before application. This means some data need to be cleaned or wrangled before using it for exploratory data analysis and model building. In this case, we need to clean our data, as it contains unnecessary variables and observations. We also have categorical predictors, and those need to be factored. Afterwards, we can use the cleansed data for data visualization and analysis, which we will perform in this section.
The dataset I will be working on was acquired from Kaggle through a data analytics company that provides learning resources, named Analyttica. This dataset was most likely generated for learning purposes, as finding churn data based off real business entities is incredibly difficult. Most businesses prefer to keep this data private. However, this dataset should provide good practice for modeling churn, as it resembles real-life churn data.
From the description on Kaggle, this dataset explores customer churn based off a bank that provides credit card services. The manager at this bank notices an increasing amount of customers leaving the bank’s credit card services. As a result, they would like to predict who is going to churn in order to reach out to those customers and convince them to stay.
This dataset contains around 10,000 observations and 23 variables. This can be viewed on Kaggle, but the key ones I’ll be using for the true data include:
Attrition_Flag - Customer activity variable. If the
account is closed then 1, else 0Customer_Age - Demographic variable. Customer’s age in
yearsGender - Demographic variable. M = Male and F =
FemaleDependent_count - Demographic variable. Number of
dependentsEducation_Level - Demographic. Education qualification
of the account holder (example: high school, college graduate,
etc.)Marital_Status - Demographic variable. Married, Single,
UnknownIncome_Category - Demographic variable. Annual Income
Category of the account holder (\(\lt
\$40\)K, \(\$40\)K - $60K$,
\(\$60\)K - \(\$80\)K, \(\$80\)K-\(\$120\)K, Unknown)Card_Category - Product variable. Type of card (Blue,
Silver, Gold, Platinum)Months_on_book - Months on book (time of
relationship)Total_Relationship_Count - Total number of products
held by the customerMonths_inactive_12_mon - Number of months inactive in
the last 12 monthsContacts_Count_12_mon - Number of contacts in the last
12 monthsCredit_Limit - Credit limit on the credit cardTotal_Revolving_Bal - Total revolving balance on the
credit cardAvg_Open_To_Buy - Open to buy credit line (average of
last 12 months)Total_Amt_Chng_Q4_Q1 - Change in transaction amount (Q4
over Q1)Total_Trans_Amt - Total transaction amount (last 12
months)Total_Trans_Ct - Total transaction count (last 12
months)Total_Ct_Chng_Q4_Q1 - Change in transaction count (Q4
over Q1)Avg_Utilization_Ratio - Average card utilization
ratio# Load packages
library(tidyverse)
library(tidymodels)
library(dplyr)
library(MASS)
library(discrim)
library(corrr)
library(corrplot)
library(klaR)
library(ggplot2)
library(janitor)
library(klaR)
library(glmnet)
library(xgboost)
library(randomForest)
library(rpart)
library(themis)
library(vip)
library(kernlab)
# Set seed, not necessary right now though
set.seed(1115)
# Load raw data
churn <- read.csv("/Users/reynaldoperez/Downloads/Customer_Attrition_Proj/BankChurners.csv")
We will perform a couple of cleaning steps in order to ensure our data is tidy. Afterwards, we will conduct a simple analysis of the data.
First, let us check the number of observations and variables.
# Check the number of rows and columns
dim(churn)
## [1] 10127 23
We have 10127 observations and 23 variables. However, we will drop the following features, as they are unnecessary for our predictive model.
CLIENTNUM - Client number. Unique identifier for the
customer holding the accountNaive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
- result of another analysis using Naive BayesNaive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
- result of another analysis using Naive Bayes# Remove unnecessary features
churn <- subset(churn, select = -1)
churn <- subset(churn, select = -c(21,22))
We want clean names. In other words, we want our variable names to be all lowercase letters. This would allow us to handle our variable names easier.
churn <- churn %>%
clean_names()
Now, we will check for any missing values.
sum(is.na(churn))
## [1] 0
We have no missing values! However, upon observing the dataset, there
are three variables with observations Unknown. These
observations do not provide us with any information. Hence, I am
thinking of turning the Unknown observations into
NA values and dropping them, but first, let’s see how many
Unknown observations there are in our variables.
sum(colSums(churn == "Unknown"))
## [1] 3380
There are 3380 Unknown observations, that’s about a
third of our data! I will decide to keep these values, as these
observations only appear in predictors education_level,
marital_status, and income_category. Besides,
I do not believe these variables will have a major impact in our
prediction modeling compared to our other more significant variables.
Plus, since we are modeling churn, it is important to have as many
observations as we can, as our data is likely unbalanced.
Now, we want to factorize our categorical variables in order to correctly implement our statistical modeling.
churn <- churn %>%
mutate(attrition_flag = factor(
attrition_flag, levels = c("Existing Customer", "Attrited Customer")),
gender = factor(gender, levels = c("M","F")),
education_level = factor(education_level, levels = c("Uneducated","High School","College","Graduate","Post-Graduate","Doctorate", "Unknown")),
marital_status = factor(marital_status, levels = c("Single","Married","Divorced", "Unknown")),
# Rename first and last levels for the income category variable
# We want consistency in our variable names
#income_category = str_replace(income_category, "<$40K", "Less than $40K"),
#income_category = str_replace(income_category, ">$120K", "$120K +"),
income_category = replace(income_category, income_category == "Less than $40K", "<$40K"),
income_category = replace(income_category, income_category == "$120K +", ">$120K"),
income_category = factor(income_category, levels = c("<$40K", "$40K - $60K", "$60K - $80K", "$80K - $120K", ">$120K", "Unknown")),
# Factorize card category
card_category = factor(card_category, levels = c("Blue","Silver","Gold","Platinum")))
Let’s display the internal structure of the data to ensure the required predictors are factors.
str(churn)
## 'data.frame': 10127 obs. of 20 variables:
## $ attrition_flag : Factor w/ 2 levels "Existing Customer",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ customer_age : int 45 49 51 40 40 44 51 32 37 48 ...
## $ gender : Factor w/ 2 levels "M","F": 1 2 1 2 1 1 1 1 1 1 ...
## $ dependent_count : int 3 5 3 4 3 2 4 0 3 2 ...
## $ education_level : Factor w/ 7 levels "Uneducated","High School",..: 2 4 4 2 1 4 7 2 1 4 ...
## $ marital_status : Factor w/ 4 levels "Single","Married",..: 2 1 2 4 2 2 2 4 1 1 ...
## $ income_category : Factor w/ 6 levels "<$40K","$40K - $60K",..: 3 1 4 1 3 2 5 3 3 4 ...
## $ card_category : Factor w/ 4 levels "Blue","Silver",..: 1 1 1 1 1 1 3 2 1 1 ...
## $ months_on_book : int 39 44 36 34 21 36 46 27 36 36 ...
## $ total_relationship_count: int 5 6 4 3 5 3 6 2 5 6 ...
## $ months_inactive_12_mon : int 1 1 1 4 1 1 1 2 2 3 ...
## $ contacts_count_12_mon : int 3 2 0 1 0 2 3 2 0 3 ...
## $ credit_limit : num 12691 8256 3418 3313 4716 ...
## $ total_revolving_bal : int 777 864 0 2517 0 1247 2264 1396 2517 1677 ...
## $ avg_open_to_buy : num 11914 7392 3418 796 4716 ...
## $ total_amt_chng_q4_q1 : num 1.33 1.54 2.59 1.41 2.17 ...
## $ total_trans_amt : int 1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
## $ total_trans_ct : int 42 33 20 20 28 24 31 36 24 32 ...
## $ total_ct_chng_q4_q1 : num 1.62 3.71 2.33 2.33 2.5 ...
## $ avg_utilization_ratio : num 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
Great! Let’s take a look at the dimensions of our data once more:
# Check number of observations and variables
dim(churn)
## [1] 10127 20
We will be working with 10127 observations and 20 variables. Let’s now take a deep dive into our data with our EDA!
With our cleansed data, we can now build a machine learning model. However, before we proceed, it is important to visualize our data in order to make sense of the it. This will involve developing visualizations of the data and analyzing key variables as well as their relationships.
To start off our EDA, we will check to see the proportion of existing
customers and customers who have churned, of the bank’s credit card
services. We will do this by creating a bar chart of our outcome
variable, attrition_flag. Since we are working with churn
data, I assume that there will be a considerably small portion of
customers who have churned.
# Bar chart of outcome variable
churn %>%
ggplot(aes(x = attrition_flag)) +
geom_bar(fill = "#6DD6B1") +
labs(x = "Attrition Flag", y = "Count", title = "Attrition Flag Bar Plot")
As one can see, the outcome variable, attrition_flag, is
highly imbalanced. Observe the skewed class proportions exemplified by
the graph. Only a small percentage of customers have churned as
expected.
This is important to keep in mind when we develop our machine
learning model later. We might need to use different performance metrics
for our imbalanced data, perhaps roc_auc instead of
accuracy. We might also need to try a couple data
pre-processing techniques that can deal with this situation, such as
up-sampling.
Now, let’s create a correlation heat map of our numeric variables in order to see their relationships.
# Only include numerical values
churn_num <- churn %>%
select_if(is.numeric)
# Calculate the correlation of each numeric variable
churn_corr <- cor(churn_num)
# Correlation plot
churn_corr_plot <- corrplot(churn_corr)
Despite the large amount of numerical predictors in our data, it is
very surprising to see such little correlation between most of the
predictor variables. Although upon further glance at our numerical
predictors, the correlation heat map does make sense. For example, it
wouldn’t make sense for customer_age and
total_revolving_bal to have any correlation with each
other, since one shouldn’t affect the other. Although, there is some
correlation between our variables. For instance, we see predictors
customer_age and months_on_book to be strongly
positively correlated with each other, as it wouldn’t make sense
otherwise. This is similar to the avg_open_to_buy and
credit_limit predictors. The average open to buy is the
average of the difference between the assigned credit limit and the
customer’s account balance, hence it is logical for both variables to be
positively correlated with each other.
Furthermore, predictor variables such
avg_utilization_ratio and credit_limit are
negatively correlated with each other since the smaller your credit
limit is, the higher the utilization ratio is when making a purchase
with the card.
We can also see that there are a handful of predictors with very
small correlations with each other as shown by the graph. However these
correlations are incredibly small, and are hardly related to each other.
To illustrate, observe that total_trans_ct and
total_relationship_count from the correlation plot have a
weak negative correlation with each other.
For the rest of this section, we will be creating visualization plots
for our significant predictors in order to analyze their relationship
with our outcome variable, attrition_flag.
First, let’s take a look at the distribution of the customer’s age
(which is our predictor, customer_age). As a bank company,
it is important that we observe the customer’s demographics, and see who
has stayed and left our credit card services.
# Distribution of customer age
churn %>%
ggplot(aes(customer_age)) +
geom_histogram(aes(fill = attrition_flag), bins = 95) +
# I'll be using these colors for our other graphs to make them look more vibrant!
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Customer Age", y = "Count", title = "Distribution of Customer Age", subtitle = "with Attrition Flag")
We can see that the distribution of customer ages in our dataset follows a normal distribution, with mean age of about 46 years. The same holds when observing based on the attrition flag, meaning that the mean age of existing and attrited customers is around 46 years.
Let’s check the proportion of customer genders.
ggplot(churn, aes(gender)) +
geom_bar(aes(fill = attrition_flag), color = "black") +
guides(fill = guide_legend(title = "Attrition Flag")) +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
labs(x = "Gender", y = "Count", title = "Proportion of Genders", subtitle = "with Attrition Flag")
Overall, the dataset is almost evenly distributed amongst males and females, as the proportion of customers by gender are nearly comparable. There are a bit more female customers than male customers in the entire dataset. We can also see that there are slightly more attrited females than attrited males. However, as mentioned, these differences are not considerable and therefore gender does not have too significant of an affect on customer churn status. We’ll still include it in our model though.
Now, we will take a look at the dependent count of our customers.
churn %>%
ggplot(aes(dependent_count)) +
geom_histogram(aes(fill = attrition_flag), bins = 16, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Dependent Count", y = "Count", title = "Histogram of Dependent Count", subtitle = "with Attrition Flag")
It’s interesting to see that less customers churn past 2 dependents. As a result, the distribution of dependent counts has a slight right skew. Perhaps a higher number of dependents would mean higher expenditures, which would make the customer more likely to stay with the bank’s credit card services as they rely on it more. Furthermore, a majority of the customers have between 2 or 3 dependents, and those with that amount tend to churn the most according to the histogram. This may not provide a lot of information, but right now we are interested in gathering demographic information of the customers.
I don’t really see how education level plays a part in whether a
customer is going to churn or not. Nonetheless, knowing the demographics
of customers can help the bank understand the characteristics of the
people who use their credit card services. Here, we filter out the
Unknown observation, as we can’t draw any conclusions from
it.
churn %>%
# Filter out Unknown observation as it does not provide information
filter(education_level != "Unknown") %>%
ggplot(aes(education_level)) +
geom_bar(aes(fill = attrition_flag), width = 0.25, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Education Level", y = "Count", title = "Bar Chart Customer Education Level", subtitle = "with Attrition Flag") +
# Make it more readible
coord_flip()
Quite diverse education levels we’ve got here! It appears that graduate students tend to make up the majority of the bank’s clients, but are also the ones who churn the most. I’m quite surprised by this observation, it makes me wonder whether this particular bank provides additional benefits or services for those with a graduate education level or higher? Unfortunately this is just speculation, as we have no access to this kind of information. In general, a majority of the customers have a formal education level, with a large amount of those with a higher level of education (graduate and over).
I plan to implement education_level in my bivariate
plots shown further below in order to see if it has some true relation
with attrition_flag.
We will check the marital status of our customers. Note that in this case, “single” means never legally married, not to be confused with single after a divorce, which is different in terms of marital status.
churn %>%
# Filter out Unknown observations as it does not provide information
filter(marital_status != "Unknown") %>%
ggplot(aes(marital_status)) +
geom_bar(aes(fill = attrition_flag), width = 0.25, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Marital Status", y = "Count", title = "Bar Chart of Customers' Marital Status", subtitle = "with Attrition Flag")
Notice that almost half of the customers in the dataset are married. Interestingly enough, there are also a large majority of single customers, despite the average customer having 2 or 3 dependents.
This is the last demographic variable we will be looking at. The income category (annual) is a key feature for our model. Income is directly correlated with credit limit (among other things such as debt), and I am presuming it will also have an affect on the transaction amount and average utilization ratio of the customers’ cards, which we will be looking at soon.
First, let’s see the proportions of the income category
churn %>%
# Filter out Unknown observations as it does not provide any information
filter(income_category != "Unknown") %>%
ggplot(aes(income_category)) +
geom_bar(aes(fill = attrition_flag), width = 0.25, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Income Category", y = "Count", title = "Bar Chart of Customer Education Level", subtitle = "with Attrition Flag") +
# Make it more readible
coord_flip()
A majority of the customers have an income less than \(\$40,000\), with income category between \(\$40,000\) and \(\$80,000\) being the second highest. Furthermore, the lowest income category contains the highest number of churners.
Now, let’s see the proportion of each income category by education level. We will also wrap the attrition flag to check their relationship with our outcome variable. Given that the majority of customers are in the lowest income category, I wouldn’t be surprised if this proportion is the largest across all education levels.
churn %>%
filter(income_category != "Unknown" & education_level != "Unknown") %>%
ggplot(aes(education_level)) +
geom_bar(aes(fill = income_category), color = "black") +
facet_wrap(~attrition_flag) +
coord_flip() +
guides(fill = guide_legend(title = "Income Category")) +
labs(x = "Education Level", y = "Count", title = "Education Level by Income Category") +
scale_fill_brewer(palette = "BuGn")
churn %>%
# Filter out Unknown observations
filter(education_level != "Unknown" & income_category != "Unknown") %>%
ggplot(aes(income_category)) +
geom_bar(aes(fill = attrition_flag), color = "black", width = 1) +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
facet_wrap(~education_level, scales = "free_y") +
coord_flip() +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Income Category", y = "Count", title = "Number of Attrited and Non-Attrited Customers with Income Category by Education Level", subtitle = "with Attrition Flag") +
# y-axis and title are squished
theme(title = element_text(size = 9), axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 6))
As expected, a majority of customers are in the lowest income
category spread across education levels. Most of them have a graduate
level education and make less than \(\$40,000\). These are also the customers
who also tend to churn the most according to the bar graphs.
Furthermore, the distributions are approximately the same across
education levels. This leads me to believe the education level doesn’t
play much of an influence in whether or not a customer a churn. As a
result, education_level will not be included in my
model.
In summary of our demographic information, the majority of our customers are around 46 years of age, have a formal education, have between 2 to 3 dependents, make less than \(\$40,000\) annually, and are married.
The bank provides four different types of credit cards: blue, silver, gold, and platinum credit cards. Information of these cards and what benefits each one provides are not described unfortunately, but given their names, I’d assume they’re ordered by the amount of benefits they provide, with platinum being the best. Nonetheless, let us see if we can still draw some conclusions based on our visual analysis of the different cards.
churn %>%
ggplot(aes(card_category)) +
geom_bar(aes(fill = attrition_flag), width = 0.25, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Card Category", y = "Count", title = "Amount of Customers in Each Card Category", subtitle = "with Attrition Flag")
About 90% of customers own the blue credit card! These are also among the highest group of customers who have churned.
Let’s see what this statistic looks like spread across different income categories. With 90% of customers holding a blue credit card, it’s obvious that the blue card category will be the highest among the different income categories. However, I’m curious to see who has the bought the other card types.
churn %>%
filter(income_category != "Unknown") %>%
ggplot(aes(card_category)) +
geom_bar(aes(fill = attrition_flag), width = 1, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
facet_wrap(~income_category, scales = "free_x") +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Card Category", y = "Count", title = "Number of Existing and Attrited Customers with Card Category", subtitle = "by Income Category") +
# y-axis is squished
theme(title = element_text(size=8), axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 6))
Again, it’s not too surprising seeing the blue card as the most popular card type across different income categories. However, it makes me wonder why it is the most popular. We only see a minuscule amount of customers who have the other card types, even in the higher income categories! Of course there are different factors for this observation that we can write a whole article about. This can range from credit score, payment history, or how long they have been with the bank, among other things. The blue card just might be the easiest one to apply for. However, we are not given this information, hence we cannot further speculate.
Now, we start getting in the more quantitative side of our EDA!
Let’s plot the distribution of the number of months a customer has stayed with the bank’s credit card services.
churn %>%
ggplot(aes(months_on_book)) +
geom_histogram(aes(fill = attrition_flag),bins = 95, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Months on Book", y = "Count", title = "Distribution of Months on Book", subtitle = "with Attrition Flag")
Whoa! There is a huge spike at 36 months! So, on average, customers
have stayed with the bank for 3 years! As for the spike itself, this may
have occurred as 3 years is a rounded figure, hence customers were
willing to end relations with the bank at 36 months. Either that, or
some event occurred in the time period that we’re unaware of, such as a
massive ad campaign. Nonetheless, with a huge spike like that, I won’t
be including months_on_book in our models.
The total_relationship_count basically counts the number
of products a customer owns from the bank, such as cards, bank accounts,
and other similar products.
churn %>%
ggplot(aes(total_relationship_count)) +
geom_histogram(aes(fill = attrition_flag), bins = 11, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Total Relationship Count", y = "Count", title = "Distribution of Total Relationship Count", subtitle = "with Attrition Flag")
From the histogram, we see that most customers own 3 or more products from the bank. Customers with 3 or less products are more likely to churn than those who own more than 3 products. This is because customers who own more products have likely decided to stay with the bank’s credit card services for their products, instead of looking for others services. In other words, those who own the most products have established themselves with the bank. Let’s compute the median in order to get a better idea of this.
churn %>%
ggplot(aes(total_relationship_count, attrition_flag, color = attrition_flag)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 4) +
coord_flip() +
guides(color = guide_legend(title = "Attrition Flag")) +
labs(x = "Total Relationship Count", y = "Attrition Flag", title = "Total Relationship Count Distribution", subtitle = "with Attrition Flag")
Notice that the red boxplot above is right-skewed, meaning that there are more existing customers as the total relationship count increases. Again, this makes sense because those who own the most products from the bank have already established they wanted to stay.
In this case, an inactive customer means that they are not actively making purchases with their card and have not used their account in some time.
churn %>%
ggplot(aes(months_inactive_12_mon)) +
geom_histogram(aes(fill = attrition_flag), bins = 7, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Months Inactive", y = "Count", title = "Distribution of Months Inactive", subtitle = "with Attrition Flag")
Most customers seem to have between 1 to 3 months of inactivity. Those with 3 months of inactivity churn the most. Perhaps the bank should try contacting these inactive customers in order to recapture them, and make them feel as if they weren’t forgotten! They can also contact them to ask how they could approve their services!
Here, we’ll be looking at how much customers are using their cards. In other words, we’ll be looking at their average utilization ratios. Let’s hope the majority of our customers are good with their finances and that they use the 30% rule!
churn %>%
ggplot(aes(avg_utilization_ratio)) +
geom_histogram(aes(fill = attrition_flag), bins = 25, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Average Utilization Ratio", y = "Count", title = "Distribution of Average Utilization Ratio", subtitle = "with Attrition Flag")
Yikes! This is a right-skewed distribution, with zero being the peak! This means that 2500 customers, on average, never use their card!
Let’s take a look at how the average utilization ratio relates to the total revolving balance of the customers’ cards. In this case we will check the correlations by each card type.
churn %>%
filter(income_category!= "Unknown") %>%
ggplot(aes(avg_utilization_ratio, total_revolving_bal)) +
# Overplotting
geom_point(alpha = 0.1) +
geom_smooth(method = "gam", se = FALSE, color = "red", linewidth = 2.0) +
facet_wrap(~card_category, scales = "free_x") +
labs(x = "Average Utilization Ratio", y = "Total Revolving Balance", title = "Average Utilization Ratio vs. Total Revolving Balance", subtitle = "by Card Type")
Observe that there is a positive correlation between
average_utilization_ratio and
total_revolving_bal in each card category. This tells us
that the more customers use their cards, the higher the unpaid amount in
their card’s next billing cycle. Additionally, we see that customers who
own a card other than the blue card type tend to have a much lower
average utilization ratio.
Next, let’s compute the median average utilization ratio based off each income category, by card type.
churn %>%
filter(education_level != "Unknown" & income_category != "Unknown") %>%
ggplot(aes(reorder(income_category, avg_utilization_ratio), avg_utilization_ratio)) +
geom_boxplot(varwidth = TRUE) +
coord_flip() +
facet_wrap(~card_category, scales = "free", ncol = 2) +
labs(
x = "Income Category",
y = "Average Utilization Ratio",
title = "Distribution of Average Utilization Ratio per Income Category",
subtitle = "Card Type"
)
Across each card type, the customers in the lowest income category have the largest average utilization ratio. Even more notable, the distributions across income level vary, as customers with higher levels of income tend to use their card less.
Let’s take a look at the customers’ assigned credit limits for their cards.
churn %>%
ggplot(aes(credit_limit)) +
geom_histogram(aes(fill = attrition_flag), bins = 80, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Credit Limit", y = "Count", title = "Distribution of Credit Limit", subtitle = "with Attrition Flag")
It’s interesting to see the large amount of outliers at the end of our right-skewed distribution. Why is it that there is a notable proportion of customers with a credit limit of near \(\$35,000\)? My assumption is that most of the customers with this large of a credit limit, are the ones associated with a higher income. We will check to see if my assumption is true, but first, let’s compute the median credit limit.
churn %>%
ggplot(aes(credit_limit, attrition_flag, color = attrition_flag)) +
geom_boxplot() +
coord_flip() +
guides(color = guide_legend(title = "Attrition Flag")) +
labs(x = "Credit Limit", y = "Attrition Flag", title = "Distribution of Credit Limit", subtitle = "with Attrition Flag")
For both existing and attrited customers, the mean credit limit is almost \(\$5,000\). However, existing customers seem to have a higher credit limit than attrited, although the difference is minuscule.
Now, let’s test my assumption that the high credit limit outliers are caused by customers in the higher income categories.
churn %>%
filter(income_category != "Unknown") %>%
ggplot(aes(fill = attrition_flag, credit_limit)) +
geom_histogram(bins = 30, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
facet_wrap(~income_category, scales = "free_x") +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(
x = "Credit Limit", y = "Count", title = "Histogram of Credit Limit by Income Category", subtitle = "with Attrition Flag"
)
As we can see, for the first three income categories, the distributions are right-skewed, but once we reach the higher income categories, the distributions change, and are now skewed left . There is a considerable amount of people in the higher income categories with a very large credit that make up the outliers seen in our first credit limit graph. This makes sense because people with higher incomes tend to have good payment histories and tend to pay off their debt, so the bank would trust them more. Thus, our assumption is correct!
We’ll now perform an analysis between credit limit and the average utilization ratio. From the correlation plot we made earlier, we saw that there was a negative correlation between the two variables. Let’s see what this looks like across the different card types.
churn %>%
filter(income_category!= "Unknown") %>%
ggplot(aes(avg_utilization_ratio, credit_limit)) +
# Overplotting
geom_point(alpha = 0.1) +
geom_smooth(method = "gam", se = FALSE, color = "red", linewidth = 2.0) +
facet_wrap(~card_category, scales = "free_x") +
labs(x = "Average Utilization Ratio", y = "Credit Limit", title = "Average Utilization Ratio vs. Credit Limit, by Card Category")
There seems to be a negative correlation between
credit_limit and average_utilization_ratio in
each card type. Therefore, customers with lower credit limits tend to
use their card more across the different card categories.
The total_trans_amt predictor measures the dollar amount
of customer transactions in the past 12 months.
churn %>%
ggplot(aes(total_trans_amt)) +
geom_histogram(aes(fill = attrition_flag), bins = 80, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Total Transaction Amount", y = "Count", title = "Distribution of Total Transaction Amount", subtitle = "with Attrition Flag")
Observe that total_trans_amt displays a multimodal
distribution for existing customers, which means we have some underlying
groups in our data. On the contrary, total_trans_amt
displays a right-skewed distribution for attrited customers. Therefore,
we can say that existing customers spent more in the past 12 months than
attrited customers. Let’s get a clearer look at this with a boxplot.
churn %>%
ggplot(aes(total_trans_amt, attrition_flag, color = attrition_flag)) +
geom_boxplot() +
coord_flip() +
guides(color = guide_legend(title = "Attrition Flag")) +
labs(x = "Total Transaction Amount", y = "Attrition Flag", title = "Distribution of Total Transaction Amount", subtitle = "with Attrition Flag")
Let’s try and cluster our data to see if we can explain this multimodal distribution.
I’m going to assume that our underlying groups are composed of
customers who have recently joined the bank, and customers who have
stayed with the bank for a while. First, I will create a scatter plot
describing the relationship between total_trans_ct and
total_trans_amt, and categorize the correlations by the
newer customers and customers who have been associated with the bank for
a longer time. From the codebook, total_trans_ct is
basically the transaction count of a customer in the past 12 months.
Obviously, the correlation between the two variables will be positive,
but we might find some clusters in the data.
We will categorize the newer customers as customers who have been associated with the bank for 2 or less years. We will plot both the newer and older customers.
# We'll display the old customer and new customer plots in the same graphing space
churn %>%
ggplot(aes(total_trans_ct,total_trans_amt, color = attrition_flag)) +
geom_point(alpha = 0.5, show.legend = TRUE) +
guides(color = guide_legend(title = "Attrition Flag")) +
# Separate between new and old customers
# New customers have been with the bank for 2 or less years (<=24 months)
facet_wrap(~months_on_book <= 24, labeller = as_labeller(c(`FALSE` = "Old Customers", `TRUE` = "New Customers"))) +
labs(x = "Total Transaction Count", y = "Total Transaction Amount", title = "Old Customers vs. New Customers", subtitle = "by Attrition Flag") +
theme(title = element_text(size = 10))
Notice the 3 clusters in the both plots above. The lower group of
clusters are the largest, and represents the two peaks in the
total_trans_amt distribution. Older customers churned more,
which explains the singular peak for the total_trans_amt
attrited customer distribution.
By comparing the two cluster graphs, we also see that most newer
customers tend to fall in the lower cluster, and less in the higher
clusters. This means that newer customers are spending less than older
customers. Based on this observation, we can safely say that the
distribution of total_trans_amt is explained by the two
underlying groups, which are the newer customers, and the customers who
have been acquainted with the bank for more than 2 years. Looking back
at the distribution for existing customers, the newer customers would
represent the lower peak, and the older customers would represent the
higher peak.
The total revolving balance is the sum of the unpaid amount that carries off to the customer’s next billing cycle. This statistic tells us the amount of customers who don’t pay their credit card bill on time.
churn %>%
ggplot(aes(total_revolving_bal)) +
geom_histogram(aes(fill = attrition_flag), bins = 80, color = "black") +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
guides(fill = guide_legend(title = "Attrition Flag")) +
labs(x = "Total Revolving Balance", y = "Count", title = "Distribution of Total Revolving Balance", subtitle = "with Attrition Flag")
The total revolving balance has a large spike at 0, but is otherwise distributed at around 1500. Those with 0 revolving balance also seem to churn the most. Let’s obtain a clearer visual of the distribution with a boxplot.
churn %>%
ggplot(aes(total_revolving_bal, attrition_flag, color = attrition_flag)) +
geom_boxplot() +
coord_flip() +
guides(color = guide_legend(title = "Attrition Flag")) +
labs(x = "Total Revolving Balance", y = "Attrition Flag", title = "Distribution of Total Revolving Balance", subtitle = "with Attrition Flag")
We have a weirdly shaped boxplot for the attrited customer group, but this makes sense given the high peak at 0. This means that customers with lower revolving balance tend to churn the most. Perhaps increasing the credit limit of these customers is a possible solution, as this would encourage customers to start using their cards more. For existing customers, the total revolving balance is distributed close to \(\$1500\).
Let’s take a look at how the total revolving balance relates with credit limit. We’ll display each correlation by income category.
churn %>%
filter(income_category!= "Unknown") %>%
ggplot(aes(credit_limit, total_revolving_bal, color = attrition_flag)) +
# Overplotting
geom_point(alpha = 0.3)+
geom_smooth(method = "lm", se = TRUE, color = "purple", linewidth =0.95, level = 0.99, fill = "purple") +
facet_wrap(~income_category, scales = "free_x") +
guides(color = guide_legend(title = "Attrition Flag")) +
scale_fill_manual(values = c("#2D7496", "#4BBC94")) +
labs(x = "Credit Limit", y = "Total Revolving Balance", title = "Credit Limit vs. Total Revolving Balance")
Those in the first three income categories in the plots above tend to have increasing revolving balances and their credit limit gets higher. However, as you go across the income categories, the slope decreases. Once you reach the higher income categories (last 2), the correlations seem to be almost 0, meaning customers in the high income categories tend to have very similar revolving balances no matter what their credit limit is.
Upon further inspection of the plots, there is a noticeable amount of attrited customers in the “\(\le{\$40}\)K income category with incredibly low revolving balances, but also really low credit limits. Perhaps this could be an explanation for customer attrition, as attrited customers who were assigned a credit limit most likely churned to another bank for their credit card services, to see if they would be given a higher credit limit. A possible solution to this would be to increase (to some acceptable degree) the credit limit among those in the lower income categories.
We finally get to the bread and butter of our project, which is building our machine learning models! Here, we will be trying our various machine learning techniques while using the same recipe, which we’ll get to in a moment. With our extensive exploratory data analysis, we have a general idea of how most of our predictors will impact churn rate. We will be following the subsequent steps for our model building process. It goes as follows:
Before we do any type of model building, we need to set up the
necessary data. This involves splitting our data, creating our recipe,
and establishing cross validation. Before we get to setting up our
models, let’s perform a final step in our data clean-up. I plan on
implementing income_category in my model, so I will remove
the Unknown observations from the variable, and replace
them with missing values that we will later impute. In addition let’s
encode our outcome variable.
# Exclude "Unknown" observations, replace them with NA
churn <- churn %>%
mutate(income_category = factor(income_category, levels = c("<$40K", "$40K - $60K", "$60K - $80K", "$80K - $120K", ">$120K")))
# Dummy code outcome variable
# 0 - Existing Customer
# 1 - Attrited Customer
churn <- churn %>%
mutate(attrition_flag = ifelse(attrition_flag == "Attrited Customer",1,0), attrition_flag = factor(attrition_flag, levels = c(0,1)))
Let’s now continue on to prepare our models!
The most essential part for our model preparation is to split our
data into training and testing sets. I will be splitting my data 80%
training set and 20% testing. This is a commonly used ratio, and for a
reason. We have a large enough testing, and our model has a lot to train
and learn from. We will also be setting a random seed here so that our
training and testing splits are the same each time we re-run the code.
It’s also a good idea to stratify on our outcome variable,
attrition_flag, as it is highly imbalanced. This will help
ensure that the number of data points in the training set is equivalent
to the proportions in the testing set.
set.seed(1115) # Set seed so the split is the same
churn_split <- initial_split(churn, prop = 0.80, strata = attrition_flag) # Split the data, stratifying on the outcome variable
churn_train <- training(churn_split) # Training set
churn_test <- testing(churn_split) # Testing set
Let’s check the dimensions of our data to ensure a proper split:
dim(churn_train)
## [1] 8101 20
dim(churn_test)
## [1] 2026 20
Both splits contain the appropriate amount of observations. \(8101\) is \(80\%\) of our full data set, which containd \(10127\) observations.
How about we take it a step further, and let’s make sure the ratios between the attrited and existing customers are similar for our split.
(sum(churn_train$attrition_flag == 1)) / (sum(churn_train$attrition_flag == 0)) # Ratio between attrited and existing customers in the training set
## [1] 0.1913
(sum(churn_test$attrition_flag == 1)) / (sum(churn_test$attrition_flag == 0)) # Ratio between attrited and existing customers in the testing set
## [1] 0.1918
The ratios are almost the same between the training and testing set. That’s how we know our data was correctly split!
Here, we will be building two recipes; which ones we’ll apply will depend on the model. The only difference between the our two recipes is one contains principal component analysis (PCA), and the other does not. In this case, we’ll only be using our PCA recipe for logistic regression and discriminant analysis to avoid rank deficiency. The other models we’re building don’t require PCA, as they reduce dimensionality and deal with collinearity to a degree.
Looking back at our EDA, we will only be including 14 out of our 19
predictors in the recipe. We’ll be excluding variables
months_on_book, dependent_count,
education_level, marital_status, and
contacts_count_12_mon. In my recipes I plan to use
upsampling using SMOTE to deal with the imbalanced data. Details of my
data processing is explained in the comments of the folded code
below.
# Recipe building
# Recipe 1, with PCA
recipe_churn <- recipe(attrition_flag ~ customer_age + gender + income_category + total_relationship_count + months_inactive_12_mon + credit_limit + total_revolving_bal + avg_open_to_buy + total_amt_chng_q4_q1 + total_trans_amt + total_trans_ct + total_ct_chng_q4_q1 + avg_utilization_ratio + card_category, data = churn_train) %>%
# We only include 14 predictors
# Credit limit needed to be logged, as it had a right-skewed distribution
# Transaction account also needed to be logged to resemble a roughly normal distribution
# We needed to categorize some of our categorical variables, use one-hot encoding
# Must scale and center all predictors, as our predictors measure different values
# Impute income_category via k-nearest neighbors to remove all nulls
# Up-sample using Synthetic Minority Oversampling (SMOTE) to deal with imbalanced data
# SMOTE will balance our outcome variable by randomly increasing our minority class, which is attrited customers, by replicating them
# This dataset has realistic ratios, so we won't upsample too much; i.e., our minority level will be less than half as many rows as our majority level in SMOTE
# We will be applying PCA to our logistic and discriminant models to reduce dimensionality and to have a set of uncorrelated variables
# We have several strongly correlated variables in our data processing, so PCA will be useful for this reason too
# PCA will also help reduce dimensionality in our one-hot encoded categorical variables
step_impute_knn(income_category, impute_with = imp_vars(all_nominal_predictors())) %>%
step_log(credit_limit) %>%
step_log(total_trans_amt) %>%
step_dummy(gender, one_hot = TRUE) %>%
step_dummy(card_category, one_hot = TRUE) %>%
step_dummy(income_category, one_hot = TRUE) %>%
step_smote(attrition_flag, over_ratio = 0.4765) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors()) %>%
step_pca(all_predictors())
# Recipe 2, without PCA
# Same recipe, just without PCA
# This will be our main recipe, since the other will only be used for logistic and discriminant analysis
recipe_churn2 <- recipe(attrition_flag ~ customer_age + gender + income_category + total_relationship_count + months_inactive_12_mon + credit_limit + total_revolving_bal + avg_open_to_buy + total_amt_chng_q4_q1 + total_trans_amt + total_trans_ct + total_ct_chng_q4_q1 + avg_utilization_ratio + card_category, data = churn_train) %>%
step_impute_knn(income_category, impute_with = imp_vars(all_nominal_predictors())) %>%
step_log(credit_limit) %>%
step_log(total_trans_amt) %>%
step_dummy(gender, one_hot = TRUE) %>%
step_dummy(card_category, one_hot = TRUE) %>%
step_dummy(income_category, one_hot = TRUE) %>%
step_smote(attrition_flag, over_ratio = 0.4765) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())
To deal with our imbalanced data, we’ll be using stratified cross
validation on our outcome variable, attrition_flag.
set.seed(1115)
churn_folds <- vfold_cv(churn_train, v = 10, strata = attrition_flag) # 10-fold CV
Finally, we reach the most crucial part of our project which is the
model building. I will be using four different machine learning methods
using the same recipe we built. Afterwards, I will pick the pick model
with the highest roc_auc and test it on the testing data
set. I will be fitting the following four models, and perform
hyperparameter tuning on some of them
The entire algorithms for these models can be viewed in my R-scripts. Here, I’ll just be loading the results I’ve retrieved from my models, as well as showing a few code-snippets to follow along the model building process.
load("/Users/reynaldoperez/Downloads/Customer_Attrition_Proj/Log_Discrim_Results.rda")
load("/Users/reynaldoperez/Downloads/Customer_Attrition_Proj/Ridge_Results.rda")
load("/Users/reynaldoperez/Downloads/Customer_Attrition_Proj/RF.rda")
load("/Users/reynaldoperez/Downloads/Customer_Attrition_Proj/XGBoost_Results.rda")
load("/Users/reynaldoperez/Downloads/Customer_Attrition_Proj/SVM_models.rda")
We will start off with the simpler classification models that will
predict our outcome variable. Here, I plan on implementing a logistic
regression model, along with defining models for linear and quadratic
discriminant analysis. We will create a workflow and add the recipe with
PCA. I then plan on fitting each of these models into our folded data,
churn_folds.
Let’s start off with a simple logistic regression model. Here, we
create a specification of the model, and apply it to a workflow, along
with our recipe, which contains the data-processing. We then fit the
model to the training set. Finally, we calculate the
roc_auc, which is the probability measure of the receiver
operating characteristic curve. Keep in mind, this is our metric for all
our models.
# Specify a logistic regression model
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Create a workflow
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(recipe_churn)
# Fit the model by applying the workflow to the training set
log_fit <- fit(log_wkflow, churn_train)
# ROC_AUC of model
log_reg_acc <- augment(log_fit, new_data = churn_train) %>%
roc_auc(attrition_flag, estimate = .pred_0)
We also follow these steps when applying models for linear and quadratic discriminant analysis.
### Linear Discriminant Analysis
lda_mod <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
# Workflow
lda_wkflow <- workflow() %>%
add_model(lda_mod) %>%
add_recipe(recipe_churn)
# Fit LDA to training data
lda_fit <- fit(lda_wkflow, churn_train)
# ROC AUC of model
lda_acc <- augment(lda_fit, new_data = churn_train) %>%
roc_auc(attrition_flag, estimate = .pred_0)
### Quadratic Discriminant Analysis
qda_mod <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
# Workflow
qda_wkflow <- workflow() %>%
add_model(qda_mod) %>%
add_recipe(recipe_churn)
# Fit QDA to training data
qda_fit <- fit(qda_wkflow, churn_train)
# ROC AUC of model
qda_acc <- augment(qda_fit, new_data = churn_train) %>%
roc_auc(attrition_flag, estimate = .pred_0)
Now, we will compare the performances of our models.
rocs <- c(log_reg_acc$.estimate, lda_acc$.estimate, qda_acc$.estimate)
models <- c("Logistic Regression", "LDA", "QDA")
results <- tibble(rocs = rocs, models = models)
results %>%
arrange(-rocs)
Our quadratic discriminant model did the best out of the three. Let’s now fit it to the testing data.
qda_test <- fit(qda_fit, churn_test)
predict(qda_test, new_data = churn_test, type = "prob") %>%
bind_cols(churn_test %>% dplyr::select(attrition_flag)) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0)
What if we applied resamples to our models? By applying resamples to all three models, we see that the logistic model is our best performing data of the three (see R script for details). Let’s fit this model into the testing set.
log_test <- fit(log_wkflow, churn_test)
predict(log_test, new_data = churn_test, type = "prob") %>%
bind_cols(churn_test %>% dplyr::select(attrition_flag)) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0)
We get an roc_auc of \(0.8248\), which is not much of a difference
from our training set. Let’s now assess our logistic regression model by
looking at its confusion matrix and ROC curve. We’ll be using the
testing set.
augment(log_test, new_data = churn_test) %>%
conf_mat(truth = attrition_flag, estimate = .pred_class) %>%
autoplot(type = "heatmap")
augment(log_test, new_data = churn_test) %>%
roc_curve(attrition_flag, .pred_0) %>%
autoplot()
For churn prediction, this is a surprisingly good model. Based on the confusion matrix, our model is relatively accurate in predicting existing and churned customers, since the Type I and Type II errors are considerably small. Also, the ROC curve looks great, as the curve is upward and on the left side of the graph. Overall, the logistic regression model is our best model thus far.
Now, we will apply a regularization model: ridge regression. To build
this model, we will specify a logistic regression model, and set
mixture = 0, which specifies a ridge regression model.
Furthermore, we will be tuning the penalty hyperparameter,
which represents the total amount of regularization. We added the recipe
without PCA, as we will do with the rest of our models from here on.
Let’s see the plot of our tuned model.
autoplot(tune_ridge)
It seems that the amount of regularization does not have an affect on the performance metrics until it hits a certain value, where there’s a huge drop. The area after the drop is the amount of regularization that doesn’t have any meaningful influence on the coefficient estimates.
We can also visualize how the magnitude of the coefficients are being regularized towards zero as penalty increases.
ridge_final_fit %>%
extract_fit_engine() %>%
plot(xvar = "lambda")
Now, let’s apply our model to the testing set.
augment(ridge_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0)
We seem to have obtained a better estimate for our model than our logistic regression model. We’re making progress, but let’s try fitting a couple more models.
For our decision tree, we tuned the cost_complexity
parameter. Let’s take a look at our graph.
autoplot(tune_res)
Observe that after a certain value, the accuracy of our decision tree drops dramatically. We will fit the model to our testing set, as we will with all models.
augment(class_tree_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0)
Now, let’s take a look at a random forest model. Just a reminder,
we’re fitting the second recipe (excluding PCA). For this model, we tune
the following parameters in the ranger engine:
mtry - number of predictors that will be randomly
sampled at each splitmin_n - minimum amount of data points in a node that
are required for the node to be further splitWe used the default number of trees provided by R, which
is 500. I planned on tuning trees, but I am unfortunately
limited by my computational power. Still, this model revealed great
results.
autoplot(tune_forest)
Observe that as the number of randomly selected predictors increased,
the higher the roc_auc value. Looking back at this however,
it seemed unnecessary to input a large range for the min_n
tuning parameter.
What variables have a large impact on whether a customer will churn from the bank? We can take a look at this by looking at a variable importance chart (VIP).
class_forest_final_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#2D7496", color = "black"))
Notice that predictors total_trans_ct and
total_trans_amt have the largest impact on our outcome
variable, attrition_flag according to the random forest. In
general, the amount of purchases customers are making greatly affect
whether or not they will churn. With this importance feature, perhaps
the bank can find a way to mitigate customer attrition.
We also notice that out of the models we’ve fitted so far, the random forest model has the highest accuracy so far:
augment(class_forest_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0)
Similar to the random forest, I set up a boosted tree model with tuning parameters:
mtry - number of predictors that will be randomly
sampled at each splitmin_n - minimum amount of data points in a node that
are required for the node to be further splitlearn_rate - rate at which the boosting algorithm
adapts from iteration-to-iterationJust like in our random forest model, trees is set to a
default of 500.
autoplot(tune_boost)
Just like our random forest model, as the number of randomly selected predictors increases, the model accuracy increases. Also note that as the learning rate goes up, the accuracy of our model decreases on the nodes.
Following a similar process to our random forest model, we will view a VIP for our boosted tree model
class_boost_final_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#2D7496", color = "black"))
Comparing the VIPs of the random forest and boosted tree models, we
observe that predictors total_trans_ct,
total_trans_amt, and total_revolving_bal are
the top three most impactful variables. However, the VIP for the boosted
model considers less impactful variables compared to the VIP of the
random forest model, and the order of the variables changed. For
example, the boosted tree model considers
total_relationship_count to have a higher significance than
total_ct_chng_q4_q1, whereas the random forest model
considers the opposite.
Now, let’s fit the boosted tree model to the testing set.
augment(class_boost_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0)
Although close, our random forest model remains to be our most accurate model thus far.
For our final machine learning model, we will be developing a support vector machine to fit into our data. Such as our previous models, we will be performing hyperparameter tuning to help improve performance.
For our SVM. we will attempt linear support to help maximize the
width of our classes. We will tune the cost parameter,
which is the cost of predicting a sample within the wrong side of the
margin.
autoplot(tune_svm_lin)
Based on the graph, it doesn’t seem like this model will give us a higher accuracy than our previous models. Nonetheless, let’s fit it to the testing set.
augment(svm_linear_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0)
Although it is a well estimate, it is definitely not one of the best models we have applied.
Now that we have analyzed the results of our models and fit them to
the testing set, it’s time choose our candidate model! Let’s show the
accuracies of our models in a single plot. We’ll choose the best model
based on roc_auc.
log_auc <- augment(log_test, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0) %>%
dplyr::select(.estimate)
lda_auc <- lda_acc$.estimate
qda_auc <- qda_acc$.estimate
ridge_auc <- augment(ridge_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0) %>%
dplyr::select(.estimate)
dec_auc <- augment(class_tree_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0) %>%
dplyr::select(.estimate)
rf_auc <- augment(class_forest_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0) %>%
dplyr::select(.estimate)
boost_auc <- augment(class_boost_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0) %>%
dplyr::select(.estimate)
svm_auc <- augment(svm_linear_final_fit, new_data = churn_test) %>%
roc_auc(truth = attrition_flag, estimate = .pred_0) %>%
dplyr::select(.estimate)
churn_aucs <- c(log_auc$.estimate, lda_auc, qda_auc, ridge_auc$.estimate, dec_auc$.estimate, rf_auc$.estimate, boost_auc$.estimate, svm_auc$.estimate)
churn_models <- c("Logistic Regression", "LDA", "QDA", "Ridge", "Decision Tree", "Random Forest", "Boosted Tree", "SVM")
churn_res <- tibble(Model = churn_models, ROC_AUC = churn_aucs)
churn_res <- churn_res %>%
arrange(-churn_aucs)
churn_res
Let’s now create a visualization of these results.
ggplot(churn_res,
aes(x = Model, y = ROC_AUC)) +
geom_bar(stat = "identity", width=0.2, fill = "#4BBC94", color = "black") +
labs(title = "Model Performance") +
theme_minimal()
We see that our random forest model performed the best!
Out of all the machine learning models we’ve applied, the random forest model performed the best.
show_best(tune_forest, metric = "roc_auc") %>%
dplyr::select(-.estimator) %>%
dplyr::slice(1)
In summary, we managed to achieve a high accuracy in our model despite our data being imbalanced. Of course, in the real world, machine learning models that test churn data won’t have as high of accuracy that we managed to obtain. It is actually much harder to mode churn realistically, however, this project gave me the chance to know what that’s like, as well as learn various ways to deal with imbalanced data.
In our models we applied a method called synthetic minority oversampling technique, otherwise known as SMOTE. We applied this technique in order to balance our data class. This method allowed us to oversample. However, if I were given the time, I would have also tried undersampling in my models in order to discover which gave the better accuracies.
With the exploratory data analysis we performed and machine learning methods we have applied, the bank’s credit card services now has an idea of what variables are making customers leave their services. This way, they can develop strategies to prevent attrition.
Overall, I’ve learned quite a lot from this project, and I hope to take on more data to apply my machine learning knowledge to!